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<U ■ Abstract 

in 

. We propose a new optimization algorithm for Multiple Kernel Learning (MKL) with gen- 

' eral convex loss functions. The proposed algorithm is a proximal minimization method that 

utilizes the "smoothed" dual objective function and converges super-linearly. The sparsity 
of the intermediate solution plays a crucial role for the efficiency of the proposed algorithm. 
Consequently our algorithm scales well with increasing number of kernels. Experimental 
results show that our algorithm is favorable against existing methods especially when the 
number of kernels is large (> 1000). 
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1 Introduction 



^ ' Kernel methods are powerful nonparametric methods in machine learning and data analysis. 

. Typically a kernel method hts a decision function that lies in some Reproducing Kernel Hilbert 

\ Space (RKHS). In such a learning framework, the choice of a kernel has a strong impact on the 

lO ' performance of a method. Instead of using a single kernel. Multiple Kernel Learning (MKL) 

Q"^ . aims to find an optimal combination of multiple kernels. In fact, it has been reported [11] 

I that using multiple kernel improves performances in learning tasks that involves multiple and 

heterogeneous data sources. More specifically MKL fits a decision function of the form of f{x) = 
Ylm=i fmix)+b where each belongs to different RKHSs TCm (m = 1, . . . , M) corresponding to 
different basis kernels km- Each basis kernel km may be constructed on different feature subsets 
^ ' of input X, or different kernel types (e.g., Gaussian, polynomial) with different parameter values 

■ (e.g., Gaussian width, polynomial order), or even may rely on different heterogeneous data 

sources associated with the same learning problem. This provides considerable fiexibility to fit 
various types of problems. According to recent formulations [UllTKTS], MKL selects the decision 
function as the minimizer of the following optimization problem: 

N / M \ M 

minimize ^^[yt^Yl /'"(^i) + ^ + Yl Wf^^W'Hm (1) 

1=1 \ m=l / m=l 

where {xi,yi}^i are labeled training examples, £{■,■) is a convex loss function (e.g., hinge, 
logistic, squared loss) and || • Ht^^ is the norm in the RKHS HrM A nice property of the 
above formulation is that the solution becomes sparse due to the mixed norm penalization 



^Note that in the literature [Tl 1171 [T^ . a different but an equivalent regularization term has been considered. 
See Sec. |4] for details. 
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Z^m ll/mliWm similarly to group Lasso Thus we can select kernels in a convex optimization 

problem. The resulting decision function is nicely interpretable because only a small number 
of fmS are used. However, solving MKL is challenging because the objective function is non- 
differentiable due to the non-smoothness of the regularization term. To overcome this difficulty, 
several methods have been proposed. 

Roughly speaking, two types of methods have been proposed so far. The first are constraints- 
based methods [1] that cast the problem as constrained convex optimization problems. [11] 
formulated MKL as semi-definite programming (SDP) problem. [T] casted the problem as a 
second order conic programming (SOCP) problem and proposed an SMO-like algorithm to 
deal with medium-scale problems. The second are upper-bound-based methods jTTl Il3l [5]. 
These methods upper-bound the objective function by a smooth function with some auxiliary 
variables; they iteratively (a) solves a single kernel learning problem, such as SVM, and (b) 
updates the auxiliary variable. A nice property of this type of methods is that it can make 
use of existing well-tuned solvers for the single kernel problem. Semi-Infinite Linear Program 
(SILP) approach proposed by (TT] utilizes a cutting plane method for the update of the auxiliary 
variable. SimpleMKL proposed by |13j performs a gradient descent on the auxiliary variables. It 
was reported that SimpleMKL converges faster than former methods. [20] proposed a novel Level 
Method as an improvement of SILP and SimpleMKL. HessianMKL proposed by [^ replaced the 
gradient descent update of SimpleMKL with a Newton update. At each iteration, HessianMKL 
solves a Quadratic Programming (QP) problem with the size of the number of kernels to obtain 
the Newton update direction. Therefore it is efficient when the number of kernels is small. 

In this article, we propose a new efficient MKL algorithm, which we call SpicyMKL. The 
proposed method computes descent steps through the optimization of a smoothed dual objective 
function, which arises from a proximal minimization [15\ in the primal. From the general theory 
of proximal minimization method, the proposed method converges super-linearly. The primal 
variable is sparse at each iteration due to the so-called soft threshold operation [3 13 ISl [21] ; this 
sparsity is effectively exploited in the proposed algorithm. Therefore SpicyMKL scales well with 
increasing number of kernels. Numerical experiments show that we are able to train a classifier 
with 3000 kernels in less than 10 seconds. 



2 Framework of MKL 

In the MKL problem, we assume that we are given n samples {xi,yi)f^i where Xj belongs 
to an input space X and yi belongs to an output space y (usual settings are y = {±1} for 
classifications and 3^ = R for regressions) . We define the gram matrix with respect to the kernel 
function km as Km = {km{xi,Xj))ij. We assume the gram matrix Km is positive definit^. The 
inner product induced by a positive definite matrix K € M^^^ is written as {a,(3)K '■= oi^ K(3 
for a,/3 G M", and the norm induced by this inner product is written by \\a\\K '■= \l (a, ol)k- 

MKL fits the decision function of the form /(x) + 6 = X]m=i fm{x) + 6 as the minimizer of 
Eq. ([1]) where each fm is an element of a different RKHS Tim ■ By the representer theorem [10] , 
the optimal solution of Eq. ([T]) is attained in the form of fm{x) = kmix, Xi)am,i- If we write 
am = . . . , am,NV, a = {aj,..., al^^ G M*^^ and i? = (i^i, . . . , Km) G R^x^*^, then 

the optimization problem Eq. ([T]) is reduced to the following finite dimensional optimization 
problem: 

N M 

minimize y^Jiyi, {Ka)i + 6) -h C ||am|k™ 

i=\ m=l 



To avoid numerical instability, we added 10 to diagonal elements of Km in the numerical experiments. 
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where represents the i-th element of a vector. Here the loss function i(y, f) may be taken as 
a hinge loss max(l — yf, 0) or a logistic loss log(l + exp(— y/)) for a classification problem, or a 
squared loss {y — /)^ or a SVR loss max(|y — /| — e, 0) for a regression problem. For simplicity 
we rewrite the above problem as 

(P) minimize fAKa + bl) + (l)cK((x), (2) 

where 1 = (1, . . . , 1)^ and 

N MM 

fdz) = '^i{yi,Zi), 4>cK{a) = 4>cK^{am) = C W^^^^Wk^- 

i=l m=l m=l 

According to [UITT], it can be shown that the optimal solution of (P) has a form of f*{x) + b* = 
Ell «KE^=i dMx, Xi)) + where < ^ < 1 and =i = 1- 



3 An augmented Lagrangian method for MKL 

In this section, we first introduce our method as a proximal minimization method. Second, 
we assume that the loss function is twice differentiable and derive a Newton method for the 
inner minimization. Finally, the method is extended to the situation that the loss function is 
non-differentiable. 



3.1 Dual augmented Lagrangian method as a proximal minimization method: 

The minimization problem (P) is a convex but a non-differentiable problem. We apply the 
proximal minimization method [15] to our problem (P) to obtain a new variant of the dual 
augmented Lagrangian method proposed in [TH] (see also [Ulll2[[5]). The proximal minimization 
method converts the problem (P) into a sequence of "smoothed" minimization problems as 
follows: 



argmin | /,(^a+61)+.^cK(a)+^ + ^^^^^ ) , (3) 

where < 7™"* < 7m^ < • • • and < 7^^^ < 7^^'* < . . . are nondecreasing sequences of penalty 
parameters and {a^^\ 6^*^) is an approximate minimizer at the t-th iteration. Starting from some 
initial solution (a^'^\b^^^), it is known that the sequence a^*-* (t = 0,1,2,...) converges to the 
minimum of (P) at a rate roughly proportional to 1/ min(7m ,7b); thus when the sequences of 
penalty parameters go to infinity the proposed algorithm converges super-linearly [15j. In order 
to carry out the minimization in Eq. ([3]) in practice, we use the Lagrangian duality [1] and 
rewrite Eq. as a min-max problem (see also |15]). 



mm max 

MN \ „^TaN 



M N M \ 

-fei-p)- E (PcKmi^mUm) - b Pi - Yl al,Kra{p-Ura)] 
m=l i=l m=l J 



M 



+ E ^hm-a^^rK^ + ^{b-b(^)f], (4) 

m=l ^ ™ ^ '6 / 

where and 4>cKm convex conjugate functions of and 4>CK^ (see [1]) and we define 

u = {ul , . . . , ujj) G M''^''^ and Um G {rn = 1, . . . , M). It is easy to verify that the inner 



3 



Table 1: Algorithm of SpicyMKL 



1. Choose a sequence 7m —>■ oo (m = 1, . . . , M), 7^ — > oo as t — > oo. 

2. Minimize the augmented Lagrangian with respect to p: 

pW = argmin, (/; +Em IIST- (aW +7^V(*)) IIL+ ri^ 

3. Update at'^ - ST"^. (a^r^i + 7^^^*)) , ^^^^^^ - b(^^ + 7. ■ 

4. Repeat 2. and 3. until the stopping criterion is satisfied. 



maximization yields the first two terms in Eq. ^ . Now we can exchange the order of minimiza- 
tion and maximization because the function to be min-maxed in the above equation is a convex 
function for {a,b) and a concave function for {p,u) (see Chapter 36 of p^)- By minimizing 
Eq. ([H) with respect to (a, b) and maximizing it with respect to Um we obtain the following 
update equations (see Appendix El for the derivation): 

at'^ = ST%.{a^^ + 7^V^*^) (m = 1, . . . , M), (5) 

where the well known soft thresholding function (see [HI 13 El EH) ST^ is defined for the MKL 
problem as follows: 

STg(^)=^;™^l':"f-"^'°\ 

\\v\\ K 

II \ \i\rn 

and G W is the minimizer of the function ip^(t){p;a^^\b^^'>) defined as follows: 

M 

^,(p;aW,6W) = ft{-p) + -— ||ST- ^(aW + 7„^p)|lL + :^(^^*^ + 76 pO', (7) 

rn=l t 

where 7 = (71, • • • , 7M) 7fe)^ ^ M*^+^. At every iteration we minimize (f^{t){p;a^*\b^^'^) with 
respect to p and use the minimizer p^*) in the update rules (Eqs. 1^ and (|6|)). The overall algo- 
rithm is shown in Tabled! We call the proposed algorithm Sparse Iterative MKL (SpicyMKL). 
The above update equations (H])-® exactly correspond the augmented Lagrangian method for 
the dual of (P) (see [18]) but derived in a simpler way using the techniques from |15j . 

3.2 Minimizing the augmented Lagrangian function: 

Note that the augmented Lagrangian (AL) function (p^{p;a,b) (Eq. (|7])) that we need to mini- 
mize at every iteration is convex and differentiable. This minimization can be carried out using 
standard techniques such as the Newton method or the quasi-Newton method. We use the 
Newton method because we can exploit the sparsity in the intermediate solution in the com- 
putation of the gradient and the Hessian of the objective function. The case where £*(y, •) is 
non-differentiable is discussed in the next subsection. Let Vm = «m + 7mP- If the conjugate 
loss function /^* is twice diff^erentiable (more specifically if £*(y, •) is so), the gradient and the 
Hessian of ^'^i{p', a, b) can be written as follows: 

V,^^{p;a,b)=Vpfli-p)+ ^ KmST-e(«m+7mP) + (& + 7bE/'^)l' 

meM+ i 
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Vp(/3^(/0; a, b) = Vlf^{-p) + ^ 7m ((1 - qm)Km + qrriKmVmvliKmj + TbH"^! (9) 

mGM+ 

where M+ is the set of indices such that \\vm\\Km > ImC ., qm = lurnr", and Vm = Vml\\vra\\Km 

" " II L'm \\Km ' M 1 1 //t 

(m G M+). 

Remark 1. The computation of the gradient and the Hessian of ^'y{p', a, b) is efficient because 
they require only the terms corresponding to the active kernels, i.e. the set of m such that 

\\am +lmP\\K^ > IniC. 

Note that the domain of •) may be some closed interval in M as long as we know that 
the minimum of the AL function (f-yip; a, b) is not attained at the boundary of the domain. This 
is for example the case for the logistic loss function whose conjugate is the negative entropy 
function t{yi, -pi) = (yiPi) log{yiPi) + (1 - ViPi) log(l - yiPi). In this case, the violation of the 
constraint can be easily prevented by the line search performed at each Newton iteration. The 
case the minimum is attained typically at the boundary (e.g., the hinge loss) is handled in the 
next subsection. 



3.3 Explicitly handling boundary constraints: 

The Newton method with line search described in the last section is unsuitable when the con- 
jugate loss function i*{y, ■) has a non-differentiable point in the interior of its domain or it has 
finite gradient at the boundary of its domain. We use the same augmented Lagrangian technique 
for these cases. More specifically we introduce additional primal variables so that the AL func- 
tion (p.y{-; a, b) becomes differentiable. We explain this in the case of hinge loss for classification. 
Generalization to other cases is straightforward, but we omit the details due to the lack of spaces. 
To this end, we introduce two sets of slack variables ^ = (^i, . . . , ^at)^ > 0, C = (Ci; • • • > Ca^)"*^ ^ 
as in standard SVM literatures (see e.g., [16]). The basic update equation (Eq. (|3])) is rewritten 
as follow^: 



A/ II W||2 



Om-a™,'iri,- (;,_b(t))2 I ||C-C**'IP 

n 11.) r, (t) ' n (t) ' „ (' 

m(^'^\beR[ rn=l 27^' 27^' 27^- 



argmm { /(a, 6, U) + E ".Z '"'^ + ^ + + ^ 



a 



where 



Eill + 4>CK{a) (if yi((Em=i Kmam)i + b) = 1 - + Ci, Vz), 



-|-oo (otherwise). 
This function / can again be expressed in terms of maximum over p € M^, u E M*^^ as follows: 



N M N M 



f{a,b,^,C) = _„™,ax^M^i - ^(-ViPi) - Yj ^CK^i^mUm) - Pi - aZ,K^{p-Um) 

1=1 m=l i=l m=l 

N N ^ 

i=l 1=1 ) 



is the set of non-negative real numbers 
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We exchange the order of minimization and maximization as before and remove a, 6, and 
Um by exphcitly minimizing or maximizing over them (see also Appendix [A]) . Finally we obtain 
the following update equations. 

at'^ = ST-,UJ^ + 7^V(*)), 6^*+^) = + 7?^ Ef=i pf^ (10) 

=max(0,# -7f )), ^ = max(0, cf - 7?..^^ ) (11) 

where p*-*^ € is the minimizer of the function (^^(t) (p; 6*-*^ C^*'') defined as follows: 

N M N 

(p^{p;a,b,(„C) = - T,yiPi+ E 2il|STIJ'„c("m + 7m/3)|li'^ + 2^(fe + 76 E Z'*)^ 

i=l m=l i=l 

AT Af 

+ 2^ E max(0, - 7c(l - ViPi)? + 2^ E max(0, - l^Vipi? : (12) 
* i=i ^ i=i 

and 7 = ({7m}m=i5 7'" 7C' 7c)^ ^ M:!''^^'^. The gradient and the Hessian of tp^ with respect to p 
can be obtained in a similar way to Eqs. ([8]) and ([9]). Thus we apply the Newton method. The 
overall algorithm is analogous to Tabled] with update equations pup -(|12|). 

3.4 Technical details of computations: 

We used Armijo 's rule to find a step size of the Newton method. During the back tracking to find 
the step size, the computational bottle-neck is the computation of ||/3-|-cAp||xm ("t- = 1, • • • > M) 
where Ap is the Newton update direction and < c < 1 is a step size. However, this computation 
is needed only on the active kernels \m I lip + ^Wr- > C or ||p + Ap + ^||_ft:„ > C] because 
of the convexity of || • Wk^- This reduces considerable amount of computation. 

4 Relations to existing methods 

4.1 Iterative Shrinkage/Thresholding: 

Another approach to minimize Eq. ([3]) is to linearly approximate the loss term fi{Ka + 61) as 
follows: 

h{Ka + hi) ^ /K^W) + V,/,(zW)(i?(a - a^) + (6 - 6^)1) 
where z^*^ = Ka^^^ + 6*^*^1. Minimization over a and h yields the following update equations: 

N 

a(^+i) = ST;^)^ (aW - 7^V./,(zW)) , h^'^') = 6^ - 7?^ Y.^V Mz^'^))^ 

i=l 

This is equivalent to the popular iterative shrinkage/thresholding (1ST) algorithm ([8] [71 [6l [T9] 
see also plj ) generalized to the MKL setting. Thus the proposed Spicy MKL can be considered 
as the exact version of the proximal minimization method (Eq. ([3])) whereas the 1ST approach 
approximately minimizes Eq. ([3]). 

4.2 Correspondence of regularization terms: 

The regularization term in SILP, SimpleMKL and HessianMKL is defined by ^(Em ll/mll'Hm)^ 
instead of C(^^ ll/mllw™) in our formulation (see Eq. ([T])). However, two formulations are 
equivalent because the minimizer {/m}m=i °1 o^'^ formulation with the regularization parameter 
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C also minimizes the objective function of SimpleMKL with the regularization parameter C 
that is defined by 

c' = c{Y.'Li\\rj\'Hj. (13) 

This is given by the relation V f^\{Y,m WfrnWurr.? = (Em ll/m||?^„)V/,„ ||/m||7^„ where V/„ is 
a sub differential with respect to fm- 

5 Numerical experiments 

In this section, we experimentally investigate the performance of the proposed method and 
existing MKL methods using several datasets of binary classification task^. We compared our 
algorithm SpicyMKL to SimpleMKL [13] and HessianMKL [5]. 

5.1 Performances on UCI benchmark datasets 

The experimental settings were borrowed from the paper [13] of SimpleMKL, but we used larger 
number of kernels. We used 5 datasets from the UCI repository: 'Liver', 'Pima', 'lonospher', 
'Wpbc', 'Sonar'. The candidate kernels were Gaussian kernels with 24 different bandwidths (0.1 
0.25 0.5 0.75 1 2 3 4 • • • 19 20). and polynomial kernels of degree 1 to 3. All of 27 different 
kernel functions (Gaussian kernels with different bandwidths and polynomial kernels of degrees 
1 to 3) were applied to individual variables as well as jointly over all the variables, i.e., in total 
we have 27 x (n + 1) candidate kernels, where n is the number of variables. All kernel matrices 
were normalized to unit trace, and were precomputed prior to running the algorithms. 

For SpicyMKL, we report the result from two loss functions, the hinge loss and the lo- 
gistic loss. For SimpleMKL and HessianMKL, we used the hinge loss. All methods were 
implemented in Matlab®. For SimpleMKL and HessianMKL, we used Matlab codes avail- 
able from http://asi.insa-rouen.fr/enseignants/~arakotom/code/niklindex.html and 
http : / / Olivier . chapelle . cc/ams/ respectively. 

For each dataset, we randomly chose 80 % of all sample points for training samples and 
the remaining 20 % were used for test samples. This procedures were repeated 10 times. Ex- 
periments were run on 3 different regularization parameters C = 0.005, 0.05 and 0.5. We 
converted the regularization parameter C of our formulation ([2]) to that for SimpleMKL and 
HessianMKL by Eq. (|13p. We employed a stopping criterion utilizing the relative duality gap, 
(primal obj — dual obj)/primal obj, for both algorithms: with tolerance 0.01. The primal ob- 
jective for SpicyMKL can be computed by using a^*) and b^^h In order to compute the dual 
objective, we first project p to the /qo ball by p' = pj max{maXm{||/o||ii-„/C}, 1} and next project 
to the equality constraint p = p' — 1(E P'i)/^ ■ Then we compute the dual objective function of 
SpicyMKL as —fg{—p). The same technique can be found in [19j. 

The performance of each method is summarized in Figure 15.11 From top to bottom, are 
shown means of CPU time, test accuracy, and the number of kernels finally selected by the 
algorithms, with standard deviations over 10 trials. We can see that SpicyMKL tends to be 
faster than SimpleMKL (factor of 5 ~ 80), and faster than HessianMKL when the number of 
kernels M is large. In all datasets, SpicyMKL becomes faster as C increases. This is because 
as the regularization becomes stronger the number of active kernels during the optimization 
decreases at a faster rate. Accuracies of all methods are nearly identical. This indicates that 
SpicyMKL properly converges to the optimal one. SpicyMKL using the logistic loss tends to 
show faster CPU time than that using the hinge loss. This is because, by the strict convexity of 
the conjugate function of the logistic loss, the Newton method in the inner loop (minimization of 
(Pj with respect to p) converges faster than the hinge loss. An interesting point is that although 

■^AU the experiments were executed on Intel Core i7 2.93GHz with 6GB RAM. 
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the logistic loss is often faster to train, yet the accuracy is nearly identical to that of the hinge 
loss. Moreover, when C is large (strong regularization) , the solution under the logistic loss tends 
to be sparser than that under the hinge loss. For the hinge loss, the number of kernels selected 
by Spicy MKL is almost the same as that selected by SimpleMKL. 

Figure 2(a) contains plots of the relative duality gaps of SpicyMKL (with hinge loss) and 
SimpleMKL against CPU time, both on the 'lonospher' dataset. We can see that the duality gap 
of SpicyMKL rapidly drops. Figure 2(b) shows the number of kernels as a function of the CPU 
time spent by the algorithm. Here we again observe rapid decrease in the number of kernels in 
SpicyMKL. This reduces huge amount of computation per iteration. 



5.2 Scaling against the sample size and the number of kernels 

Here we investigate the scaling of CPU time against the number of kernels and the sample 
size. We used 2 datasets from IDA benchmark repositorjf^: 'Ringnorm' and 'Splice'. The same 
relative duality gap criterion with tolerance 0.01 is used. We generated the basis kernels by 
randomly selecting subsets of features and applying a Gaussian kernels with random width 
= + 0.1, where is a chi-squared random variable. In Figure [3] the number of kernels 
is increased from 50 to 6000. The vertical axis shows the CPU time averaged over 10 random 
train-test splitting where the size of training set was fixed to 200. We observe that the CPU time 
of HessianMKL is the smallest for small number of kernels, but it grows rapidly as the number 
of kernels increases. On the other hand, CPU time of SpicyMKL has a milder dependency 
to the number of kernels. In particular, SpicyMKL is tens times faster than SimpleMKL and 
HessianMKL when the number of kernels is 6000. In Figure U] the number of training samples 
is increased. The number of kernels is fixed to 20. The scaling behaviour of the CPU time of 
SpicyMKL is comparable to other methods. 

""http: //ida. first . fhg.de/projects/bench/benchmarks .htm 
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Figure 2: Duality gap and 7^ of active kernels against CPU time 
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Figure 3: CPU time as a function of the number of kernels 



6 Conclusion and future direction 

In this article, we have proposed a new efficient training algorithm for MKL with general con- 
vex loss functions. The proposed SpicyMKL algorithm generates a sequence of primal variables 
by iteratively optimizing a smoothed version of the dual MKL problem. The outer loop of 
SpicyMKL is a proximal minimization method and it converges super-linearly. The inner min- 
imization is efficiently carried out by the Newton method. The numerical experiments show 
SpicyMKL scales well with increasing number of kernels and it has similar scaling behaviour 
against the number of samples to conventional methods. The logistic- loss SpicyMKL has shown 
the best computational efficiency and improved sparsity at a comparable test accuracy. Future 
work includes a second order modification of the update rule of the primal variables and applying 




500 1000 1500 2000 2500 3000 3500 500 1000 1500 2000 2500 3000 
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Figure 4: CPU time as a function of the sample size 
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the techniques used SpicyMKL for upper-bound based methods. 



A Derivation of the update equations 

We start from the min-max equation (Eq. ^) and derive the SpicyMKL update equations 
(Eqs. ([5])-(I7j)). First by exchanging the order of minimization and maximization and completing 
the squares, we obtain, 



Eq. (gD = max <^ -//(-p) - E {<P*CK^i^rnUm) + K„,{p - Um) + ^\\p- UmllxJ 

pGK-'^ l_ m=l 



bMJV 



Furthermore by turning the maximization into a minimization and moving the minimization 
with respect to Um inside, we have. 



Eq. = - mm |/;(-/,) + E^ii <^rn{p) + E^i P. + ^ (E.=i P.)'}, 



(15) 



where 



<^m{p) = min ( 4>*cKjKmu'Jl^ri^) + -^\\u'm " ("^^ + 7^V)llL ) + ^onst, 

(we redefined jmUm as u'^ for notational convenience) and const is a term that only depends 
on 7^*) and a^*). Now since <t)cKm,{v) = C\\v\\Kmi have 



{\\u\\k^ < C), 
+QO (otherwise). 



Let Vm = Oim + 7mV- From a simple geometric consideration, we have, 



^Ht) _ Jt) min(|[t>^ 11^^, 7^ C) _ ^ 

Thus we obtain. 



I I'm \\Km 



^m(p) = ^l|ST:^.)^(aW +7W^)|j2^^. (16) 



We obtain Eq. ([7]) by substituting Eq. (I16p into Eq. p5p and rearranging terms. Furthermore, 
from the minimization in Eq. ()14p we have: 



am — ttm + 7m P^ ^ — ^im — fc> 1 (j) ( Qm + 7m P^ ^ j , " — ^ + 7b Z^i=l 



where is the minimizer in Eq. ()15p or Eq. ([7j). 



10 



References 

[1] F. Bach, G. Lanckriet, and M. Jordan. Multiple kernel learning, conic duality, and the 
SMO algorithm, 2004. 

[2] F. R. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine 
Learning Research, 9:1179-1225, 2008. 

[3] D. P. Bertsekas. Constrained Optimization and Lagrange Multiplier Methods. Academic 
Press, 1982. 

[4] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. 

[5] O. Chapclle and A. Rakotomamonjy. Second order optimization of kernel parameters. 
In NIPS 2008 Workshop on Kernel Learning: Automatic Selection of Optimal Kernels, 
Whistler, 2008. 

[6] P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. 
Multiscale Modeling and Simulation, 4(4):1168-1200, 2005. 

[7] I. Daubechies, M. Defrise, and C. D. Mol. An Iterative Thresholding Algorithm for Lin- 
ear Inverse Problems with a Sparsity Constraint. Communications on Pure and Applied 
Mathematics, LVII: 1413-1457, 2004. 

[8] M. Figueiredo and R. Nowak. An EM algorithm for wavelet-based image restoration. IEEE 
Trans. Image Process., 12:906-916, 2003. 

[9] M. Hestenes. Multiplier and gradient methods. Journal of Optimization Theory & Appli- 
cations, 4:303-320, 1969. 

[10] G. S. Kimeldorf and G. Wahba. Some results on tchebycheffian spline functions. Journal 
of Mathematical Analysis and Applications, 33:82-95, 1971. 

[11] G. Lanckriet, N. Cristianini, L. E. Ghaoui, P. Bartlett, and M. Jordan. Learning the kernel 
matrix with semi-definite programming. Journal of Machine Learning Research, 5:27-72, 
2004. 

[12] M. Powell. A method for nonlinear constraints in minimization problems. In R. Fletcher, 
editor. Optimization, pages 283-298. Academic Press, London, New York, 1969. 

[13] A. Rakotomamonjy, F. Bach, S. Canu, and G. Y. Simplemkl. Journal of Machine Learning 
Research, 9:2491-2521, 2008. 

[14] G. Rockafellar. Convex Analysis. Princeton University Press, Princeton, 1970. 

[15] R. T. Rockafellar. Augmented Lagrangians and applications of the proximal point algorithm 
in convex programming. Math, of Oper. Res., 1:97-116, 1976. 

[16] B. Scholkopf and A. Smola. Learning with Kernels: Support Vector Machines, Regulariza- 
tion, Optimization and Beyond. MIT Press, Cambridge, MA, 2002. 

[17] S. Sonnenburg, G. Ratsch, C. Schafer, and B. Scholkopf. Large scale multiple kernel learn- 
ing. Journal of Machine Learning Research, 7:1531-1565, 2006. 

[18] R. Tomioka and M. Sugiyama. Dual Augmented Lagrangian Method for Efficient Sparse 
Reconstruction, 2009. 



11 



[19] S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo. Sparse Reconstruction by Separable 
Approximation. IEEE Trans. Signal Process., 2009. 

[20] Z. Xu, R. Jin, I. King, and M. R. Lyu. An extended level method for efficient multiple 
kernel learning, pages 1825-1832, 2009. 

[21] W. Yin, S. Osher, D. Goldfarb, and J. Darbon. Bregman Iterative Algorithms for Ll- 
Minimization with Applications to Compressed Sensing. SIAM J. Imaging Sciences, 
1(1):143-168, 2008. 

[22] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. 
Journal of The Royal Statistical Society Series B, 68(l):49-67, 2006. 



12 



